The skillIQ and roleIQ tests are addictive. I haven’t used Pluralsight to learn and improve my technical skills yet, but I can see how the assessments would drive interaction and frequent improvement of subscribers. What a fun way to encourage personal and professional development!

Data Exploration Questions

1. Describe and visualize how the distributions of user and question rankings compare and relate between assessments.

User Ranking Distributions

Overall Ranking Metrics

Using the user_assessment_sessions dataset we can see the distribution of the various metrics for the 6678 user sessions.

A comparison of the assessments is shown below. The consistency initially surprised me, but it makes sense that the distribution across the assessments varies mildly given the need to provide a standardized evaluation process agnostic to the actual assessment.

Interaction Ranking Metrics

A look at the distributions of the user_interactions dataset again shows some variability between assessments (which may be important if I knew more about the methodology), but no more than expected.

Question Ranking Distributions

Question Ranking Metrics by Assessment

The distributions of rd and client_elapsed_time are heavily skewed so I took the log (base 10) of client_elapsed_time and dropped rd altogether. Later (question 5) we see that the question based rd value is generally 30 for the majority of the question interactions. I also removed the records with outliers in the client_elapsed_time; values > 99th percentile = 147260.3 and values < 0.

Per the plot we see that the distributions of client_elapsed_time for the assessments are very similar, though React is slightly shifted to the right of the other three assessments. That may be significant given it’s presented on the log scale. The distributions of the ranking metric vary in a noisy way but still follow the same general structure.

2. How does it appear the algorithm determines when a user’s assessment session is complete?

We can evaluate the algorithm’s determination to stop asking questions using a time-series of each assessment. The obvious guess is a minimal threshold for question-to-question changes in the RD value. Something very similar to that guess is confirmed by observing a random sample of several user_assessment_session_ids.

It’s probably worth checking the other metrics associated with a session (display_score, percentile, and ranking) to confirm our suspicions regarding rd as the main variable driving the algorithm. Per the plots below of the same three assessment sessions we see that rd is the only metric of the four that seems an appropriate option.

A closer look at the distribution of the minimum rd values of each assessment’s interaction shows that a simple threshold of 80 drives the stopping rule. Over 75% of the sessions were stopped at a rd value below and very near 80. While that seems like an arbitrary value to me, I am sure there was some empirical and theoretical studies performed to determine that threshold. Also 75% may seem low, but that includes all sessions, even those that were stopped prematurely by the user (as discussed in question 3).

3. Which of the assessments has the highest and lowest dropout rates, respectively?

Assuming the threshold for completing the assessment is 80, then the overall dropout rate is around 25% (1608/6678 = 24.07%). The dropout rates for each assessment vary substantially, including React at 34.9%, Illustrator at 32.1%, Python at 21.8%, and Javascript at 21.0%.

While the plot below doesn’t quite tell the full story it does help illustrate that it generally takes 18-20 questions answered to obtain the rd threshold of 80. There is a subtle negative slope to the blue points indicating that the users who dropout early (< 10 questions) were likely incorrectly answering the majority of the few questions they actually answered. A closer look using an approach similar to how what I did in question 5 may allow us to test that hypothesis.

Javascript has some unusually high number of questions answered at the highest end of the display_score that still don’t quite result in a rd value dropping below the threshold. That may be something worth examining in more detail.

4. Is there significant variance in question difficulty by topic within a given assessment?

To measure question difficulty I chose to calculate how often (percent) a user answered a given question correctly. The density plots below show the range of question difficulty for each Assessment:Topic combination. Clearly the range of question difficulty varies greatly across the Assessment:Topic combinations. Some topics, like Python: Scalars and Operators and Illustrator: Transforming and Managing Objects, span nearly the entire range of values showing both easy (questions frequently answered correctly) and difficult questions. Other topics, like React: Forms and Javascript: Exceptions, have questions that are generally answered correctly the same percent of the time. For such topics this lack of variability may make it hard to differentiate scores and rankings when compared with topics containing more variety in the difficulty of questions.

The range (max - min) of question difficulty (percent correctly answered) for each Assessment:Topic combination is listed below. I didn’t check the frequency of the questions or of the topics which would affect both the percent correctly answered and possibly the range of difficulty.

topic_name range
Illustrator
Color 57.1
Document Setup 100.0
Drawing and Painting 100.0
Object Effects & Blending 100.0
Preparing for Print 100.0
Selection Tools 75.0
Transforming and Managing Objects 85.7
Typography 100.0
Working with Placed Graphics 44.2
Javascript
Arrays 75.0
Basics 100.0
Exceptions 21.6
Functions 41.7
Object Oriented JavaScript 52.3
Objects 56.0
Operators 58.1
Statements 2.8
Types 100.0
Unknown
NA 100.0
Python
Collections 100.0
Correctness 69.0
Development Environment 76.9
Functions 100.0
Modules 100.0
Objects 66.7
Scalars & Operators 80.0
Strings & IO 100.0
Syntax 100.0
React
Components 68.6
Events and Binding 33.5
Forms 8.5
JSX 30.1
Lifecycle 9.4
Performance 39.7
Props 18.1
State 22.7
Styling 9.7
Testing 25.2

5. How many times must a question be answered before it reaches its certainty floor? Does that number appear to be constant or does it vary depending on question or assessment?

There are 724 questions in the dataset. I expect the rd metric to again indicate the certainty floor. A quick look at the distribution of rd values shows that floor to be 30. However, many (71.1%) of the assessment_item_ids show all of their rd values to equal 30. Maybe that’s because those are older questions that reached the floor (30) previous to this dataset.

We would really like to look at all 724 of these questions. We could examine much of the structure using trelliscopejs, a tool for interactively viewing a large collection of visualizations. The key opportunity when using trelliscope is that it allows for creation of a rich feature set that is then used to sort and filter through the data helping us see nuances, outliers, and important features of that data.

A brief description of the cognostics (features) is available by clicking on the “i” in the upper left corner. You can search for interesting assessment_item_ids by using the Sort and Filter buttons on the left hand side. To see those assessment_item_ids that have values of rd other than 30, click on the Filter button, then on the “All RD values = 30” pill, then enter “0” into the right side. This will reduce the total number of panels from 724 to 209. To see panels (plots) where at least two points are present (and thus a plot is created), remain clicked into the Filter button, then click on the “Number of Question Interactions” pill, then enter 2 on the left hand size of the range selection. This immediately removes all the blank panels (not plotted because only one observation exists) and reduces the number of panels from 209 to 180. Clicking on the Filter button again closes that window. You can sort or filter further to test hypotheses or explore the data sliced by assessment_item_id. Happy exploring!

Also note that the plotting panel function can be ggplot or rbokeh based. Here I used bokeh so even within the plot some interactivity exists.

Obviously a rd value of 30 is important and relevant, but I didn’t find anything else that gave me sufficient confidence to answer this question explicitly beyond saying there appears to be plenty of variation between questions.

Maybe I should have used a similar approach (trelliscope) to examen the various assessments in the previous questions.


More Involved/Open-ended Questions

1. Identify a metric that could be used to identify questions that are performing poorly, and consequently might need to be reviewed, changed, or removed.

  • Questions that render a nearly always incorrect answer, especially when the question difficulty is comparatively low. (Some questions are likely purposefully difficult so one expects those to rarely have a correct response.)
  • Questions that increase the RD metric substantially (though that may be a function of question order).
  • (thinking of a scatterplot comparing rd change due to that question vs current percentile of the user, meaning some identification of outliers occurring when rd change is high and negative and percentile was low)

2. Suppose an update to Python causes a question’s answer to change, but our question authors don’t notice, and the now-outdated question remains in the test. How might that scenario reveal itself in the data?

Hopefully it reveals itself as often rendering an incorrect response. That may not be true of more experienced or long-time users of that technology/language so one might need to account for that somehow. I noticed a link at the bottom of the page after the answer is revealed that provided an opportunity for a situation like this to be identified.

3. Given your response to number 2 in the Data Exploration Questions above, what is a method we could use to determine ideal points to stop a user’s assessment session (i.e. identify the right balance between certainty and burden on the user)?

You could compare the display_score (and ranking) to the changes/decreases in rd as you approach the rd threshold. Some rules or a model may be applied to allow you to occasionally shave off 3-5 questions per assessment if you sufficiently understood the relationship between display_score and rd at the late stage of the assessment.

I suppose you could try to account for the distribution/curve of previous assessments of that user. For example, if they have taken several assessments before the current assessment you may be able to predict/extrapolate the end score and ranking based on their position part way through the assessment.

Taking that a wild step further, why not treat each step of an assessment (for a giving topic) as a modeling and prediction opportunity by developing a deep learning model trained to the eventual outcome of the assessment. That way you could use the thousands (or millions) of assessments for that topic to generate a prediction such that you could stop the assessment once the prediction has reached a certain threshold of accuracy per the model. Just to be clear I am thinking of a different deep learning model (or potentially any predictive model) for each set of questions of a given topic in order. That wasn’t very clear so … one model based on five questions answered, then a model based on six questions answered, and so on.

4. How could we calculate the overall difficulty level of a particular topic? How might we then calculate a topic-level score for a single user?

You may get close by determining what combination of topics tend to be taken by users. If a set of users are prone to take the same five topic assessments (and rarely other topics) then you could look to see if which topic was the most difficult to that group. As an example, business analysts may consistently take the data warehousing, data analytics/visualization, SQL, and Python assessments and often struggle score lower in the Python assessment.

I wonder if the frequency of the topic assessed is an indicator of the difficulty. Certainly the frequency relates to the popularity and the general demand/usefulness of the topic; as well as the newness of the topic (newer tools/tech/languages may be taken less frequently - following an adoption curve). Fortran or other older languages/technologies may be considered more difficult simply because less modern learning methods exist for them.

How is “difficult” defined here?